Hi, welcome everybody to deep learning. Thanks for tuning in. Today's topic will be the back
propagation algorithm. So you may be interested in how we actually compute these derivatives
in complex neural networks. So let's look at a simple example and this simple example here is
that we want to evaluate the following function. So our true function is 2x1 plus 3x2 to the power
of 2 plus 3. And now we want to evaluate the partial derivative of f at the position 1, 3
with respect to x1. And there's two algorithms that can do that quite efficiently and the first one
will be finite differences, the second one is an analytic difference. So we will go through both
examples here. Now for finite differences the idea is that you compute the function value at
some position x and then you add a very small increment h to x and you also compute the original
function position f of x and you take the difference between the two and then divide by the
same value of h. So this is actually the definition of a derivative. So it's the limit between f at x
plus h and f of x divided over h and we let h approach zero. Now the problem is this is not
symmetric so sometimes we want to prefer a symmetric definition of this. So we instead of computing
this exactly at x and x plus h we go half an h back and half an h to the front. So this allows us
to compute exactly at the position x and then we still have to divide over h. So this would be a
symmetric definition. Now if you do that we can do it in our example. So let's try to evaluate this
and we take our original definition 2x1 plus 3x2 to the power of 2 plus 3. We wanted to look at the
position 1, 3 and let's just calculate this. Let's use the plus half h definition up above here. So
we set h to a small value. Let's say a small value is 2 times 10 to the power of minus 2
and we plug it in. You can see that here in this row so this is going to be 2 times 1 plus half of
our h plus 9 to the power of 2 plus 3 and of course we also have to subtract our small value
in the second term and we divide by the small value as well. So this then lets us compute the
following numbers. So we will end up with approximately 124.4404 minus 123.5604 and this
will be approximately 43.999. So we can compute for any function even if we don't know the
definition of the function. If we only have it as a module that we cannot access we can use finite
differences to approximate the partial derivative. In practical use we use h in the range of
1 times 10 to the minus 5 which is appropriate for floating point position. Actually depending
on the precision of your compute system you can also determine what the appropriate value for h
is going to be. You can check that in reference number 7. We see that this is really easy to use.
We can evaluate this on any function. We don't need to know the formula definition
but of course it's computationally very inefficient. Imagine you want to determine
the gradient that is the set of all partial derivatives of a function that has a dimension
of 100. This means that you have to evaluate the function 101 times in order to compute this entire
gradient. So this may not be such a great choice for general optimization because it may become
very inefficient but of course it's a very very cool method to check your implementation.
Imagine you implemented the analytic version and sometimes you make mistakes then you can use this
as a trick to check whether your analytic derivative is correctly implemented.
This is also something you will learn in the exercises here. Really really useful if you
want to evaluate those functions. Now the analytic gradient we can defer by using a set of analytic
differentiation rules. So first rule is going to be the derivative of a constant is going to be 0.
Then our operator is a linear operator which means we can rearrange it. If you have for example
sums of different components then we also know the derivatives of monomials. If you have some
x to the power of n then the derivative is going to be n times x to the power of n minus 1.
And the chain rule so if you have nested functions and you see the chain rule is essentially
the important thing that we also need for the backpropagation algorithm then you see that the
derivative with respect to x of some nested function is going to be the derivative of the
function with respect to g and the time multiplied with the derivative of the function g by the
value of x. Okay so let's place those to the very top right here. We will need them on the
next couple of slides and let's try to calculate this. So here you see that partial derivative
with respect to x1 of f of 1 3. Then we can just plug in the definitions so this is going to be
Presenters
Zugänglich über
Offener Zugang
Dauer
00:18:37 Min
Aufnahmedatum
2020-05-28
Hochgeladen am
2020-05-28 12:06:35
Sprache
en-US
Deep Learning - Feedforward Networks Part 3
This video introduces the basics of the backpropagation algorithm.
Video References:
Tacoma Narrows Bridge Collapse
Credit: Stillman Fires Collection; Tacoma Fire Department (Video) - Castle Films (Sound) 1940 - This content is licensed under Creative Commons. Please visit here for more information about the license(s).
Further Reading:
A gentle Introduction to Deep Learning